| id | name | color |
|---|---|---|
| 1 | floof | gray |
| 2 | max | black |
| 3 | cat | orange |
| 4 | donut | gray |
| 5 | merlin | black |
| 6 | panda | calico |
Day 3 - Introduction to Data Analysis with R
Freie Universität Berlin - Theoretical Ecology
October 11, 2024
Let’s look at some examples
Tidy
| id | name | color |
|---|---|---|
| 1 | floof | gray |
| 2 | max | black |
| 3 | cat | orange |
| 4 | donut | gray |
| 5 | merlin | black |
| 6 | panda | calico |
Non-tidy
| floof | max | cat | donut | merlin | panda |
|---|---|---|---|---|---|
| gray | black | orange | gray | black | calico |
| gray | black | orange | calico |
|---|---|---|---|
| floof | max | cat | panda |
| donut | merlin |
Sometimes raw data is non-tidy because its structure is optimized for data entry or viewing rather than analysis.
The main advantages of tidy data is that the tidyverse packages are built to work with it.
Let’s go back to the city data set from earlier:
cities_tbl
#> # A tibble: 10 × 4
#> city population area_km2 country
#> <chr> <dbl> <dbl> <chr>
#> 1 Istanbul 15100000 2576 Turkey
#> 2 Moscow 12500000 2561 Russia
#> 3 London 9000000 1572 UK
#> 4 Saint Petersburg 5400000 1439 Russia
#> 5 Berlin 3800000 891 Germany
#> 6 Madrid 3200000 604 Spain
#> 7 Kyiv 3000000 839 Ukraine
#> 8 Rome 2800000 1285 Italy
#> 9 Bucharest 2200000 228 Romania
#> 10 Paris 2100000 105 FranceThis already looks pretty tidy.
cities_untidy
#> # A tibble: 2 × 11
#> type Turkey_Istanbul Russia_Moscow UK_London `Russia_Saint Petersburg`
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 population 15100000 12500000 9000000 5400000
#> 2 area_km2 2576 2561 1572 1439
#> # ℹ 6 more variables: Germany_Berlin <dbl>, Spain_Madrid <dbl>,
#> # Ukraine_Kyiv <dbl>, Italy_Rome <dbl>, Romania_Bucharest <dbl>,
#> # France_Paris <dbl>What’s not tidy here?
Let’s tidy this data using functions from the tidyr package!
pivot_longer()One variable split into multiple columns can be solved with pivot_longer
#> # A tibble: 2 × 11
#> type Turkey_Istanbul Russia_Moscow UK_London `Russia_Saint Petersburg`
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 population 15100000 12500000 9000000 5400000
#> 2 area_km2 2576 2561 1572 1439
#> # ℹ 6 more variables: Germany_Berlin <dbl>, Spain_Madrid <dbl>,
#> # Ukraine_Kyiv <dbl>, Italy_Rome <dbl>, Romania_Bucharest <dbl>,
#> # France_Paris <dbl>
pivot_longer()One variable split into multiple columns can be solved with pivot_longer
step1
#> # A tibble: 20 × 3
#> type location value
#> <chr> <chr> <dbl>
#> 1 population Turkey_Istanbul 15100000
#> 2 population Russia_Moscow 12500000
#> 3 population UK_London 9000000
#> 4 population Russia_Saint Petersburg 5400000
#> 5 population Germany_Berlin 3800000
#> 6 population Spain_Madrid 3200000
#> 7 population Ukraine_Kyiv 3000000
#> 8 population Italy_Rome 2800000
#> 9 population Romania_Bucharest 2200000
#> 10 population France_Paris 2100000
#> 11 area_km2 Turkey_Istanbul 2576
#> 12 area_km2 Russia_Moscow 2561
#> 13 area_km2 UK_London 1572
#> 14 area_km2 Russia_Saint Petersburg 1439
#> 15 area_km2 Germany_Berlin 891
#> 16 area_km2 Spain_Madrid 604
#> 17 area_km2 Ukraine_Kyiv 839
#> 18 area_km2 Italy_Rome 1285
#> 19 area_km2 Romania_Bucharest 228
#> 20 area_km2 France_Paris 105pivot_longer()One variable split into multiple columns can be solved with pivot_longer
Another way to select the columns to pivot:
separate_wider_delim()Multiple variable values that are united into one can be separated
#> # A tibble: 20 × 3
#> type location value
#> <chr> <chr> <dbl>
#> 1 population Turkey_Istanbul 15100000
#> 2 population Russia_Moscow 12500000
#> # ℹ 18 more rows
#> # A tibble: 20 × 4
#> type country city value
#> <chr> <chr> <chr> <dbl>
#> 1 population Turkey Istanbul 15100000
#> 2 population Russia Moscow 12500000
#> 3 population UK London 9000000
#> 4 population Russia Saint Petersburg 5400000
#> 5 population Germany Berlin 3800000
#> # ℹ 15 more rows
The opposite function is called unite. Check out ?unite for details.
pivot_wider()One observation split into multiple rows can solved with pivot_wider
#> # A tibble: 20 × 4
#> type country city value
#> <chr> <chr> <chr> <dbl>
#> 1 population Turkey Istanbul 15100000
#> 2 population Russia Moscow 12500000
#> # ℹ 18 more rows
#> # A tibble: 10 × 4
#> country city population area_km2
#> <chr> <chr> <dbl> <dbl>
#> 1 Turkey Istanbul 15100000 2576
#> 2 Russia Moscow 12500000 2561
#> 3 UK London 9000000 1572
#> 4 Russia Saint Petersburg 5400000 1439
#> 5 Germany Berlin 3800000 891
#> # ℹ 5 more rows
We can also use a pipe to do all these steps in one:
drop_na()Drop rows with missing values:
This is an easier and more intuitive alternative to filter(!is.na(...)).
Task (30 min)
Tidy data with tidyr
Find the task description here
Selina Baldauf // Tidy data with tidyr